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Abstract. Statistical physics approaches can be used to derive accurate predictions 
for the performance of inference methods learning from potentially noisy data, as 
quantified by the learning curve giving average error versus number of training 
examples. We analyse a challenging problem in the area of non-parametric inference 
where an effectively infinite number of parameters has to be learned, specifically 
Gaussian process regression. When the inputs are vertices on a random graph and 
the outputs the noisy function values being fitted, we show that replica techniques can 
be used to obtain exact predictions in the limit of large graphs. The covariance of 
the Gaussian process prior is defined by a random walk kernel, the discrete analogue 
of squared exponential kernels on continuous spaces. Conventionally this kernel 
is normalised only globally, so that the prior variance can differ between vertices; 
as a more principled alternative we consider local normalisation, where the prior 
variance is uniform. The starting point is to represent the average error as the 
derivative of an appropriate partition function. We rewrite this in terms of a graphical 
model, where only neighbour vertices are directly coupled. Treating the average over 
training examples and random graphs in a replica approach then yields learning curve 
predictions for the globally normalised kernel. The results apply generically to all 
random graph ensembles constrained by a fixed but arbitrary degree distribution. 
For the locally normalised kernel, the normalisation constants for the prior have to 
be defined as thermal averages in an unnormalised model. This case is therefore 
technically more difficult and requires the introduction of a second auxiliary set of 
replicas. We compare our predictions with numerically simulated learning curves 
for two paradigmatic graph ensembles: Erdos-Renyi graphs with a Poisson degree 
distribution, and an ensemble with a power law degree distribution. We find excellent 
agreement between our predictions and numerical simulations. We also compare 
our predictions to existing learning curve approximations. These show significant 
deviations; we analyse briefiy where these arise. 
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1. Introduction 

Ever since the seminal work of Seung, Sompolinsky and Tishby [Ij , it has been recognised 
that statistical physics can make significant contributions to the understanding of 
methods and algorithms that learn from examples. The performance of a learning 
algorithm or inference method is captured in the learning curve, which records the 
average prediction error over all training sets of a given size, or generalisation error, as 
a function of the number of training examples. 

Learning curves have been successfully analysed using statistical physics for a 
variety of parametric learning methods (where a finite number of parameters must be 
learned) by taking advantage of the interpretation of the average over training sets as 
quenched disorder [H [21 [3l IH |5l [6] . Rather more challenging is the non-parametric case, 
where the number of parameters is effectively infinite. An important example in this 
class is the learning of functions (regression) with Gaussian process (GP) priors. Here 
the set of parameters to be learned is in essence the entire underlying function that 
one is trying to estimate from the data. GPs have been widely adopted in the machine 
learning community as efficient and fiexible inference methods. This is due to a few 
important advantages, namely that prior assumptions about the function to be learnt 
can be seamlessly encoded in a transparent way, and that inference (at least in the case 
considered in this paper) is relatively straightforward. 

The general regression setting is as follows. We are given a set of data (a;, y) 
consisting of a set of inputs x = (xi, . . . , xn)^ and a corresponding vector of outputs 
y = {yi, . . . , Un)^ and wish to infer the underlying true function y = f{x). In a Bayesian 
approach we infer not a single function /, but a posterior distribution, P{f\x, y), over a 
function space. The posterior distribution is calculated from two elements: a likelihood 
P{y\f, x) which measures the probability the outputs y were generated by a particular 
function /, and a prior distribution P{f) which encodes any prior beliefs about how 
plausible different functions would be in the absence of data. The prior can encode 
properties such as the expected smoothness of the function, its typical amplitude, or 
the lengthscale on which it varies across the input space. The posterior is calculated 
from the prior and likelihood using Bayes' theorem: 



For the particular case of GPs, the prior is assumed to be a Gaussian process [Tj. 
This means that any set of function values f{xi),...,f{xk) is assumed to have a 
joint Gaussian distribution. The statistics of this Gaussian distribution, and hence the 
Gaussian process, are specified by a covariance function or kernel C{x,x') and a mean 
function, /i(x). The kernel gives the covariance {f{x)f{x')) under the prior between 
the values of / at input points x and x'. It is the kernel that gives GPs and other 
kernel based methods their intuitive nature as it describes in a simple manner prior 
assumptions such as smoothness, lengthscale of variation and typical amplitude of the 
function we aim to learn. The mean function n{x) gives the average of f{x) under the 




(1) 
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prior, and is normally set to zero unless one has very strong prior knowledge about a 

non-zero function mean. 

We will consider the simplest case for the likelihood: that data points 
are generated by corrupting /(x^) with normally distributed noise, i.e. 

P{y\f,x) oc exp(— ^^(/(a;^) — ?/^)^)/(2(T^). Then the posterior distribution is again 

a Gaussian process, and the posterior mean and variance of /(x) can be expressed in 

closed form as [7j 

/>*) = k{x*fK-^y (2) 
Var(/(x*)) = C{x\x*) - k{x*fK-^k{x*) (3) 

with the matrix K and vector k having entries /Cjj = C{xi,Xj) + S^a'^ and ki{x*) = 
C{xi, X*) respectively. One would then use f{x*) as the prediction for the function value 
at a test input x*, and the square root of the posterior variance can be used to give an 
error bar for this prediction. 

Analysing the average performance of GPs at predicting a function from data, 
i.e. calculating the learning curve, is an interesting problem that has been well studied for 
the case of functions defined over continuous spaces [H [9l [101 El Il21 HSl HH |15] . However, 
far less is understood about the learning curves for GP regression for functions defined 
on discrete spaces. With the rise of large structured data such as the internet, author 
collaboration networks, protein networks and financial markets, such an understanding 
of the performance of GPs, and machine learning techniques in general, on discrete 
spaces is becoming important. 

In this paper we concern ourselves with obtaining predictions for the learning curves 
of GP regression for functions on large random graphs. We study GPs trying to predict 
a function / : V — )■ M defined on the vertices of a graph G{V, S) with vertex set V and 
edge set S. As is standard in the analysis of GP learning curves we will focus on GPs 
with zero mean. (Our results can easily be generalised to non-zero mean, but at the 
expense of increased notational complexity which we avoid here.) We note that in the 
context of GPs on graphs the covariance function will he a. V x V covariance matrix 
where V = |V| is the number of vertices of the graph. 

We focus on GPs with covariance structure described by a random walk kernel 
first introduced in [16] and further developed in [1^. These are generalisations of the 
frequently used square exponential kernels in continuous spaces (see for example [7]). 
We define the random walk kernel by: 

C = K-^'^ (I-aLYK-'/^ 

where a and p are hyperparameters of the kernel, K = diag(Ki, . . . Ky) is a normaliser 
which controls the prior variance of /, I is the identity matrix, A the adjacency matrix 
of G {aij = 1 if G S, aij = otherwise), D is the diagonal degree matrix with 
Da = Y2j ciij aiid L = I — D^^/"^ AD"^/'^ the normalised Laplacian [18]. (The Laplacian 
defined in [18] differs slightly, in that for single vertices i that are disconnected from 
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the rest of the graph it has La = 0. Our definition has La = 1 for this case, which we 
presume to be consistent with the usage in [T6lll7j. In the case of graphs without single 
disconnected vertices the two definitions of the normahsed Laplacian will agree. We use 
this adjusted normalised Laplacian because analysis is easier without the special case 
and for local normalisation of the kernel, for which we will shortly argue in favour of, 
the resulting normalised covariance function will be the same.) Equation (jl]) is called 
the random walk kernel because it defines the importance of neighbouring vertices in 
predicting a function value at some given vertex i in terms of the frequency with which 
a lazy random walk starting from i lands at the neighbour. We can view as the 
probability of making a 'step' and p the number of attempts the walker makes. With 
this interpretation p/a, the average number of steps of the lazy random walk, can be 
thought of as the lengthscale of (jlj) [19]. 

We will consider two normalisations of the random walk kernel: global normalisation 
and local normalisation. In the case of global normalisation we fix all Ki to be equal, 
chosen such that the average prior variance (1/A^) YlJ=i ^^e desired value which 

we fix to unity. For local normalisation, we set the /tj so that the function value at each 
vertex has the same prior variance, namely Cu = 1. It is worth noting, that from a 
Bayesian modelling point of view, local normalisation is more natural since we do not 
expect to have strong prior knowledge that would justify using different prior variances 
at different vertices. However, we also analyse the case of global normalisation because 
that is the form for the kernel suggested in [161 [H] and used in e.g. |20l HH, [22]. This 
case also serves to illustrate the general structure of the replica approach we use for the 
analysis, before we tackle the rather more challenging scenario of local normalisation. 

Our main aim in this paper is to show that for the setting of GP regression of 
functions on graphs we can calculate the learning curve exactly in the limit of large 
graphs. Our results will be valid for graphs chosen from a broad range of random 
graph ensembles, which are defined by an arbitrary degree distribution. This is rather 
remarkable: in the case of GP regression of functions defined on continuous spaces, exact 
learning curve predictions are possible only in very special cases [23]. Otherwise, one 
has to rely on approximations that typically give only qualitatively accurate predictions, 
and can fail dramatically in some scenarios, e.g. for low noise level cr^. 

The rest of this paper is structured as follows: we begin in section [2] by defining the 
generalisation error and expressing it in terms of a generating partition function. This is 
then rewritten in a form that can act as a suitable starting point for a replica approach, 
using additional variables to obtain a model with interactions only between neighbouring 
vertices. We briefiy compare our method to a similar approach by Opper [24] and 
point out when our method and that of Opper's diverge. In section [3] we use a replica 
approach to calculate the learning curves. We show in section 13.11 that the case of 
global normalisation can be solved by applying a method similar to that used in [25] . In 
section [3^ we proceed to analysing learning curves for the locally normalised kernel. We 
show that the introduction of local normalisation brings a great deal more complexity 
to the problem: we need to use an additional set of replicas to account for the fact that 
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the normalisation is now tied tightly but in a non-local way to the quenched disorder 
given by the graph structure. The statistics of the auxiliary replicas are obtained by 
a saddle point approach that is nested inside the conventional saddle point method 
for the ordinary replicas. Section H] compares our replica predictions to numerically 
simulated learning curves and existing learning curve approximations [HI El], suitably 
extended from continuous to discrete input spaces. We also discuss the qualitative 
differences between the learning curves produced by the globally and locally normalised 
kernels. Finally section [5] summarises our results and discusses potential avenues for 
future research. 

2. Learning curves 

To begin with we express the generalisation error for GP regression on a large random 
graph as the derivative of an appropriate partition function Z. We will introduce 
additional variables to get this into the form of an undirected graphical model. By this 
we mean, following the terminology in the machine learning literature, a Boltzmann 
distribution with a Hamiltonian containing, in addition to local "energy" contributions 
for each vertex, only pairwise interaction energies between neighbouring vertices on the 
graph. By making some assumptions on the graph structure, namely that the graph is 
random subject to an arbitrary specified degree distribution, we will be able to apply 
a replica method to calculate the dependence of the generalisation error on the number 
of training examples, i.e. the learning curve. Physically, the reason that this procedure 
can be carried through is the fact that random graphs from the ensembles considered 
here typically have locally tree-like structures. Accordingly, the results that we find can 
also be derived using a cavity method [26l [27] . 

The generalisation error of a GP is defined as the squared deviation between the 
predicted function, i.e. the student's posterior mean, {f)f\x,y, and the true teacher 
function g that generated the data. This is then averaged over data sets, i.e. noise 
corrupted outputs generated from the true function g for a fixed set of inputs x, and 
independently and identically distributed inputs x^. Finally one averages over the prior 
distribution of teachers g (in this paper assumed to be a GP) and, in our random graph 
setting, the graph ensemble Q: 



Here we have defined / and g to be ^-dimensional vectors with /j = /(xj) and 
9i = g{.Xi). 

In this paper we will assume, as is typical (though see [23]) in learning curve 
calculations, that teacher and student have the same posterior distribution, i.e. a 
matched scenario. The generalisation error in this case is also the Bayes error, i.e. 
the lowest average error achievable for a teacher function from the given GP prior. 
Under this assumption ([5]) simplifies to just the posterior variance of the student (see 
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for example [7]). Since we require only the posterior variance we shift our functions so 
that fi now represents the deviation of the function / from the posterior mean. The 
generalisation error can then be expressed as 

The posterior for the redefined / is proportional to the product of the prior P{f) 
and the quadratic exponential terms from the likelihood P{y\f,x). The latter are 
6^P(~ /^/ (^o"^)) = exp(— Yli=i lifi / where 7^ counts the number of examples 
seen at vertex i. Note that in line with the expression (jS]) for the GP variance not 
depending on the training outputs, the posterior distribution for the shifted function 
values only depends on the training inputs, hence the notation f\x rather than f\x,y 
in (E]). For the same reason we have also been able to drop the averages over y and g. 

We now wish to rewrite ([6]) in terms of a graphical model. Taking a statistical 
physics approach we rewrite the generalisation error in terms of a generating partition 
function Z: 

Z = /d/exp(^4rC-/-^X:^'/^-^/'/)- (8) 

To bring this into a graphical model form we need to untangle the relationship between 
function values at different vertices arising from the prior. We first rewrite the prior 
term as an integral over its Fourier transform in order to eliminate the inverse of the 
covariance function. Carrying out the remaining, now factorised, integral over the fi we 
are left with 

Z = |5|^/2 j ^;,g^p ^_^f^Tch - ^h^Sh^ (9) 

where S = diag (l/(^ + A), . . . , l/(^ + A)) and we have dropped constant factors that 
do not depend on A. U p = 1 we would now have the required graphical model form. For 
larger p, however, the first term in ([9]) still relates more than nearest neighbour vertices. 
We can make headway in dealing with the complicated interactions introduced by C by 
rewriting (jl]) in terms of a binomial expansion. 



g=0 



(10) 



We can now introduce h'^ = [K^^'^D^^/'^ AD'^^'^K^^^'^Y h into ([9]) recursively, by 
setting = h and enforcing h'^ = K^^'^ D^^^'^ AD^^^'^ K^^^'^h'^^^ for g = 1, . . . ,]? using 
a Fourier transformed delta function. Our partition function then becomes 



q=0 q=l 



P P / P 

S\'/^ I f[d/i«J]d/i'?exp -1^0,(0^^"'^"- ^(0^-^^° 



(11) 



q=l q=l 
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with Cq = (a"^)^(l — a~^)^~'^ and where we have again dropped A-independent factors. 

We now have the desired form for Z. Rescahng variables in f lTT]) to (h*)'^ = h^d~^^'^K^^^'^ 
and {h*)j = fifd^ ^^"^ k'J'^ and rewriting exphcitly in terms of vertex and interaction terms 
we have, after dropping the asterisks again, 

(12) 

iev (i,i)G£ 

with 

«(^, .) ^ f t c,m, + ^^^^ - U ± m, - \ log (-^) (13) 

p 

J{h, h') = iJ2 {h\hy-' + {hyh"-'^ (14) 

9=1 

and h = (h^, . . . ,h^,h^, . . . , W)"^ . Equation f|T2|) is now in the form of a (complex- 
valued) Gaussian graphical model on a graph with local fields and interactions only 
between nearest neighbour pairs. 

We can now proceed in the standard way using replicas. Since (logZ)a;,g is hard 
to calculate we use the rephca trick, (YogZ)x,g = \vain^o}i^og{Z''^)x,g, to bring the 
averages inside the logarithm. We calculate (Z")a,,g for integer n and perform an analytic 
continuation to n — )■ at the end. 

Even after introducing replicas, the average of the replicated partition function 
over inputs (and, here, graphs) may be analytically intractable, depending on what 
stage in the calculation averages are carried out. It is here that our method and earlier 
approximations in continuous spaces [8l [12] differ: in the latter approaches, because 
the graph structure could not be exploited, the average over inputs x has to be kept 
in general form. To proceed, a Gaussian variational approximation was used in |12j . 
One can show [26] that for the case of GPs defined on graphs as considered here, this 
variational approximation of the input average in terms of two moments only ignores 
some of fluctuations in the number of examples 7j seen at each vertex. In this paper we 
will take a different approach. We do not attempt to carry out the input average from 
the start; instead we perform the graph average and decouple vertices by introducing 
replica densities. The input average can then be kept without approximation right until 
the final saddle point equations. That this is possible is based on the following fact: if N 
training inputs are chosen randomly from, say, a uniform distribution over vertices, 
then for a large graph {V — )■ oo) this is equivalent to each 7^ being independently sampled 
from a Poisson distribution with mean v where v = N/V. We can therefore carry out 
the input average independently over the different vertices, without approximating the 
distribution of the 7j. This will result in predictions for the learning curve that are exact 
across the entire range from small to large u. For simplicity we restrict ourselves here to 
the case of a uniform input distribution, though all results generalise straightforwardly 
to the case where the input distribution is oJi/V, with the Ui weights of order unity that 
sum to V. 



Z= j d^]^exp f-'H(/ii,(ii,Ki,7i)j J]^ exp -J{hi,hj) 



Learning curves of Gaussian process regression on random graphs 



8 



Finally in order to perform the average over the graph ensemble, we consider 
ensembles consisting of all graphs with a fixed but arbitrary degree distribution p{d). To 
ensure that the graphs are sparsely connected and therefore locally tree-like, we assume 
that the mean degree d is finite, i.e. does not grow with V . With these assumptions we 
are now able to apply the replica method to (|T2l) . 

3. Replica analysis 

Our method for calculating ([7]) from the average of log Z as defined in ( fT2|l for V oo 
follows the approach of [25]. We begin by briefly considering the graph average for the 
class of ensembles of graphs assumed in this paper. 

For a fixed degree sequence {di, . . . , rfy}, a uniform distribution over all graphs G 
with the correct degree sequence is given by 



with Af defined to be the normaliser of P{G\{di . . . dy})- All the results that we 
are interested in are invariant under a relabelling of the graph vertices, and will 
therefore depend only on the degree distribution p{d) = y J2i^di,d- The mean degree is 
d = ^ — YlidP^^)^- '^'^^ factors p{aij) above do not affect P{G\{di . . . dy}) as the 
number of non-zero aij is fixed by the degree constraints, but simplify the calculation. 

We split our replica analysis of the learning curve into two sections corresponding to 
two different ways of normalising the covariance kernel, as specified by the normalisation 
coefficients Kj. In section 13. we consider a globally normalised kernel where Kjii — t\j^ 
a constant equal to the average prior variance of the unnormalised kernel. This fixes 
the average of the prior variances at all vertices to unity. In section 13.21 we consider a 
locally normalised kernel, where is set equal to the local prior variance of vertex i of 
the unnormalised kernel: this forces all local prior variances to be exactly (rather than 
just on average) equal to unity. 

3.1. Global normalisation 

For a large {V — )■ oo) graph, the normalisation by the average prior variance of the 
unnormalised kernel does not depend on the specific graph sampled. Thus we may solve 
( |T3l) with an arbitrary constant = k. In order to calculate the learning curve all that 
is then required is to set initially k = 1, calculate the generalisation error at z/ = and 
set K equal to the result. The generalisation error from the globally normalised kernel 
can then be calculated with this k for any required value of v. 

The learning curve calculation for global normalisation will follow essentially the 
method of [25], so we will keep explanations brief. Our derivation begins by raising ( fT2ll 




P{G\{di . . . dv}) = Yf Ylpi(^ij)^a,j,a,, n ^'^-E. 




(15) 
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to the power n, substituting and including the graph and input averages 



I „ n In 
/ a=l iGV V a=l 



(16) 



X n exp(-5^j7(/i^^^) 



a=l 



We aim to rewrite (1161) in terms of 'replica densities' so that we may calculate the graph 
average and then decouple different vertices. We define these replica densities as 



1 



(17) 



and incorporate them into (|T6ll with a functional delta using conjugate densities p. 
Substituting ( !T7|) and the functional delta enforcement term into f lT6|) . performing the 
graph average and rearranging terms (see Appendix A. 2.1 where di is also defined) we 
obtain 



with 



N 



r'p'Dpexp 



l^(^(Si[p]-l)-i52[p,p] + 53[ 



/It 
W dh^d{h^yp{h\ . . . , h-)p{{h})\ . . . , (/i")') exp 
a=l 

/n 
J]d/iV(/i\...,/i")p(/i\...,/i") 
a=l 

/ /• " / \ fip(/i\ . . . , /l" 

Sz[p] = p(^) log ( / n ^^"^ - E ^) 

d X"^ a=l \ a=l 



(18) 

(19a) 
(196) 

. (19c) 



We see that for ^ — )■ oo, (1181) will be dominated by its saddle point. To proceed we 
must make a specific assumption about the form of the replica densities at this saddle 
point. As in [25j we assume symmetric replicas, making the ansatz 



(20) 
(21) 

(22) 



a=l 

» n 

p{h\...,h-) = -id V^n[^]l[ 



exp I —ip{h°-] 



m 

exp ( —'ip{h°-] 



a=l 



m 



with the convention that 

Z[/] = 1 d/iexp(-/(/i)). 

This amounts to assuming each single replica function has a Gibbsian form. The average 
over TT and vr can be interpreted as representing, physically, an average over vertices i. 
We have included the prefactor —id in (12T|) because this will ensure, in the end, that 
both TT and vr are normalised probability density functions. 
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We now substitute the ansatz (12U]) and ([21]) into fll9al) to (119 cp . and expand in 
the number of rephcas n up to 0{n). Terms involving A first feature in the 0{n) term 
of Sz[f>\- Since the generahsation error ((71) requires a derivative with respect to A we 
must consider both the leading 0(n°) and sub leading 0{n) terms in calculating our 
saddle point contributions. We find that the leading 0{n^) terms cancel with the graph 
normaliser, A/", and constrain tt and tt to be normalised distributions. The subleading 
saddle point equations together with these 0{n^) normalisation constraints then give 



the self-consistency equation (see Appendix A. 2. 2 ) 

p{d)d 



with 



d-l 



. . . , i;''~']{h) = i^\h) + n{h, d, K, 7) 

1=1 

^[ilj]{h) = - log [ dh'exp (-^{h') - J{h, h'] 



(24) 
(25) 

Looking at the explicit form of J and "H, one sees that these equations are solved by 
'energy functions' ip{h) that contain only quadratic terms in h. This of course makes 
sense as we started from the partition function of a (complex) Gaussian graphical model. 
We can then write these functions as ip{h) = ^h^V^^h in terms of their (complex 
valued) covariance matrices V. In terms of the distribution of the V, the self-consistency 
equation (1231) becomes 

, d-l 
Y[dV'7l[V' 



n[V] 



y-v p{d)d 

d 



d 



d-l 



6{V -{O-^XV'X) 



(26) 



with 



= d 



2C1 



1^ 



ic 



2 P 



V 







— 1 



\ 







p,p 



X 



J 







Op+i,p+i 






i 




... 


i 








i 





J 



At first sight fl26l) appears to have a singularity when 7 = and A — )■ 0. Using 



Woodbury's identity (see Appendix A. 2. 3) we can eliminate this apparent divergence 
and solve the self-consistency equation numerically using population dynamics |28j . 
This is an iterative technique where one creates a population of covariance matrices and 
for each iteration updates a random element of the population according to the delta 
function in (|26|) . The update is calculated by sampling from the degree distribution p{d) 
of local degrees, the Poisson distribution of the local number of examples u and from 
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the distribution 7r[V*] of "incoming" covariance matrices, the latter being approximated 
by uniform samphng from the current population. 

Once a numerical solution for vr[V] has been found, the generalisation error can be 
calculated from 

where we have defined S^^'^^ as the 0{n) terms in the small-n expansion of 6*3 (see 
Appendix A. 2. 2, also for the definition of Md), defined eg = (1,0, .. . ,0)"'" and again 



taken advantage of the Woodbury identity. 

Equation (1271) has a simple interpretation: once information from the rest of the 
graph has been folded in, any graph vertex has an 'effective prior variance' with precision 
(ifi:(iVf^^)oo- This is then combined with the 7 local examples to arrive at the final 
posterior variance at this vertex. 

It is interesting to ask what the distribution of cavity covariances irlV] will be for 
different u. Figure 13.11 (left) & (centre) show the log-distribution of Vqo for random 
regular graphs (where each vertex has fixed equal degree) with degree d = 3. For 
u = 0.0001 the distribution appears to consist of delta peaks. This can be confirmed 
analytically by a Taylor expansion of fl2B]) about u = 0. Each peak can be shown to be 
related to the variance of a vertex in a regular tree in which an example has been seen 
at increasing distances from the vertex. One would expect the peak with the highest 
amplitude to correspond to the variance of a vertex with an example seen at a distance 
p, because the number of vertices affected by an example grows with distance from 
example vertex. Further peaks are produced by examples at distances smaller than p, 
with height decreasing as distance and therefore number of affected vertices decreases. 
The Voo values also decrease, since the nearer to an example a vertex is, the more the 
local posterior variance is reduced. For larger u more and more peaks are added until 
the distribution of Vqo becomes effectively a continuous distribution, as seen for z/ = 1. 
Once u becomes very large the GP has effectively learned the target function with very 
little remaining uncertainty, and so the distribution of posterior variances Vqo becomes 
increasingly peaked near zero; this trend can be seen for u = 10. 

Figure 13.11 (right) shows the equivalent plot for Erdos-Renyi random graphs (see 
section H] for a description of this type of graph) with average degree 3. Since graph 
structure is now no longer uniform, one does not see a pattern of delta peaks for small 
u as was the case for regular graphs. There are however visible peaks, around Vqo = 0, 
Voo = 1-15 and Vqo = 1-4, superimposed on a continuous distribution. This is from 
vertices with degree 1, 3 and 2 respectively. To see why, note that the cavity covariance 
matrices V can be interpreted as messages from a vertex that are sent to a neighbour, 
after incorporating the incoming messages from all other neighbours. Since vertices 
with degree 1 have no incoming messages, all messages sent from such vertices will be 
identical, resulting in a large peak in the distribution of Voq. The same reason also 
gives peaks from nodes with higher degree, as long as u is small so that the majority 
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(Left) Normalised histograms of the top left entries Vqo from a population of covariance 
matrices that solves the self-consistency condition f l26|) . for random regular graphs with 
degree c? = 3 and v = 0.01. (Centre) As left but for u = 1 and u = 10, the latter 
just visible as a narrow peak at the left edge of the distribution for u = 1. (Right) 
Analogue of left and centre figure combined for Erdos-Renyi random graphs with 
average connectivity 3. 

of vertices has not yet seen an example (corresponding to 7 = in ( 126|) ). For example, 
the second peak in the figure is from vertices with degree 3 that receive two incoming 
messages from neighbours with degree 1. As these incoming messages are deterministic, 
so is the outgoing message if no local example has been seen. The third peak arises 
similarly from degree 2 vertices with an incoming message from a vertex with degree 1. 

3.2. Local normalisation 

Using the insight gained from the simpler globally normalised kernel, we now derive a 
prediction for the learning curve for the more complicated case of a locally normalised 
kernel. Here is equal to the prior variance at vertex i, as calculated from the 
unnormalised kernel. Although this may seem like a trivial change to the normalisation, 
it requires that properties of the quenched disorder are effectively defined via a second 
thermal average, making the calculation rather more complicated. The change from 
global to local normalisation can also be shown to cause significant changes to the GP 
prior over functions, and in turn to the learning curves |26] . 

We begin our calculation by including the site-dependent normalisation constants 
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Ki in (IT^ via a delta enforcement term 

X Jj5 (^Ki - ^ y d/aux(/aux,i)^exp (^-^/aTixCo" Vaux^ ^ • 

Here Zaux is defined to be the normaliser for an auxiliary GP /aux, and Co = (/ — aLY 
is the unnormalised kernel from (jl]). 

Because the definitions of the Ki depend on the entire graph in a non-local way, the 
partition function f l28|) no longer has the structure of a graphical model. Before we can 
apply the replica method, we need to bring it back into this form. This will require the 
introduction of an additional set of replicas, for the auxiliary GP. 

Focussing on the Kj-enforcement terms, we rewrite the delta functions in terms of 
their Fourier transform by introducing a conjugate variable ki, and replace the integral 
over /aux in the exponent with an empirical average over L — oo replicas of /aux- With 
an element of foresight we also introduce Aaux — )■ so that later equations are well 
behaved for ki = 0. The delta enforcement terms from (!28l) are then given by 

1- /n§ri(^)expfe 



Aaux^O igV 1 = 1 ^ '^"-^ ^ \ieV 



If 2\ki 



(^+Aaux)x:(/^ 



I \2 
aux,i, 

1=1 



(29) 



xexpl-^> Y/:jT^o''/aux 



The non-local coupling via Cq^ can be reduced to nearest neighbour interactions 
as for the original GP variables /. In order to get ( l28ll into the form of a graphical 
model it then remains to deal with the graph dependent Zaux term. We bring Zaux 
into the numerator by introducing m — 1 replicas of Zaux and taking m — )■ so that 
Z~^^ = limm^Q Z!^~^ . As is conventional in replica calculations, we then exchange 
the limits m — ?■ and L — oo, taking the latter first. The L-replica average over 
(/aux.j)^ = (/iux.i)^ then be symmetrised to an Lm-replica average over the (/aux,i)^ 
with b = 1, . . . , m. This is because we have an infinite number mL of replicas at any 
fixed m > so that both averages become non-fluctuating. At this point the only effect 
of m is via the total number of replicas mL, and to simplify the notation we can fix 
m = 1. At the end of the calculation of (logZ) we then in principle have to replace 
L by mL and take m — )■ 0. However, as for the generalisation error we only need the 
A-derivative, which does not directly depend on the auxiliary degrees of freedom, this 
will not be necessary. 

With these simplifications, and after also rescaling ki by a factor of L for later 
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convenience, the delta function constraints for the k,- take the form 



hm 

L—^oo 



1=1 Kiev L 1=1 \ J 



(30) 



Finally rewriting (130 p using the same techniques as we did for Z in section [21 by 
Fourier transforming the covariance, integrating out the remaining terms, introducing 
2p additional variables at each site, rescaling according to {h*)l^^- = h^^^^d^ and 

A ^ -I Irx 

(^*)aux,j = ^lux,A dropping the asterisks again, we arrive at the desired graphical 

model form from the partition function: 

- V ^ 



^ ^ 1=1 {i,j)€£ V 1=1 

X JJ exp I -U{hi, di, Ki, ji) + ilKiki - ^ 'H^n^U^^^^i, di, ki 
iev \ 1=1 

Here we have defined 

/T-auxV-aux) fi^J — „ / '-g"'aux"'aux ^ r, \ , o- - / ^ "'aux"'aux 

2 ^ 2 Aaux + 2lK ^ 



(31) 



(32) 



2 V Aaux + 2iK^ 

With fl28l) now in a form suitable for application of the replica method we proceed 
initially as we did in the case of global normalisation in section 13.11 We bring the graph 
and input average inside the logarithm using the replica trick at the cost of n additional 
replicas. The resulting average of reads 

■a dK^K^L^ ^ ,7Ua 



^^^^-'^ = iT^ ( / n n d^au^ n -p - E ^(^^ 

a=l \ ^ > 1=1 J iGV \ a=l 

n n L \ 

+iL E <f^t ^aux(/iL\., d., kt) (33) 

a=l a=l 1=1 / 

(n n L 

- J2 Jihl h^) ^('^aux,^' '^aux.) 

a=l a=l 1=1 

Similarly to the global case we introduce rephca densities 

p{h}, . . . , h,", /iaux5 ■ ■ ■ 1 ^aux5 Z^^, • • • , K^, . . . , k") 

1 - " 

= 7 E ^'"^ n ^(^'^ - - - '^") 

i a=l 

n L 
a=l 1=1 



D,g 



(34) 



Learning curves of Gaussian process regression on random graphs 



15 



with enforcement via conjugate densities p. We substitute the densities into ( 15^ and 
perform the average over the graph ensemble to derive, using techniques similar to those 
for the global case (see Appendix A.2.1 ), 



VpVpexp V 



d 



(5i[p]-l)-i52[p,p]+53 



(35) 



where 
Slip] = 



a=l \ 



~ ~ dn'^dk^L din")' dik")' L 



27r 



271 



nd^auxd(/^al)' P(---)p(---0 



1=1 



n L 



X 



a=l 



S2[p,f>\ 



a=l 1=1 

p(... )«...) 



1=1 



a=l 



La 
aux 



SM = E P(^) ( / n I d/i'^^^^^ n d/iit ) exp ( - Y.n^"^ 7) 



a=l 



1=1 



n L 



a=l 
d ' 



+iL ^ -EE ^-x(/*il, «:") 



a=l 



a=l 1=1 



Mi_-)) 



(36 a) 



(36 &) 



(36 c) 



We have not written the limits Aaux ~^ and L ^ oo explicitly here. 

As before, the expression fl35l) for the average of the replicated partition function 
will be dominated by its saddle point for large V. We assume that this saddle point has 
a replica symmetric form and set 



Pi- 



X 



n 



exp ( -il){h'') - V'«(/t", k") j ^ exp ( -ip^nK{h[ 



.l,a ^ 
aux/ 



(37) 



a=l 



1=1 



p (...) = -irf / VipVip^V^^^^Tiltp,^^,^^^^] 

" exp ( -V^(h,") - tp^in'', k") ) a. exp ( -V'aux(^: 



n 

a=l 



]zii-^] 



n 

1=1 



l,a ' 
aux; 



(3^ 



Z[iP, 



Any dependences between h, /laux, ^ and k are encoded in the distributions vr and tt. 
Given that the latter have the interpretation of distributions across vertices, one would 
expect that ipi^{K,°', k"-) ends up constraining to a definite value, with tt then expressing 
how the local distributions of /i" and h}^'^^ are correlated with this value of Kq. This is 
indeed what we will find. 

Substituting fl37j) and fl38l) into (136 ap to (136 cp . we can write each of the exponent 
functions as expansions in n and nL, setting Si = Sf^^^ + nSf^"'^ + nLSf^""^^ up to 
linear order in n. As before, and as is standard in replica calculations, the subleading 
0{n) terms are required in order to calculate the generalisation error. The novel feature 
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of the present calculation is the presence of the 0{nL) term arising from the auxiliary 
replicas. 

The expansions for 5*1 and 5*2 are calculated in a similar manner to the global case 
and yield 

( ~ ~ ~ ~ \ (39) 

/• - - exp f-J(/laux,^Lux) -^aux(^aux) -^aux(^Lux)^ 

log / d^auxd^aux — T^TT^l 

J ^[^aux]^[^aux] 
aux 

exp (-^aux(/iaux) " ^aux(/iaux) 

X log / dh,«„x ■ 



(40) 



^[^aux]^[V^aux] 

for 0{nL), while for 0{n) 



X log 



exp (-J{h, h') - i){h) - ^'(^') 
d/id^' ^ , , , , ^ (41) 



zmzm 

exp (^-ij{h) - ij{h) - ip^{K, k) - ^^(k, k 



X log / dhdndk- 



(42) 



The expansion for 6*3, on the other hand, is more complicated (see Appendix A.3.1 ). 
We extract 0(1) terms using the same technique as for the globally normalised kernel, 
leaving a logarithm containing both 0{n) and 0{nL) terms. Collecting the exp(iLKfi;) 
term and the integral over h-aux we can rewrite the integrals over k and k in a form that 
can be evaluated by a saddle point method for large L. The saddle point equations fix 
fi; = and K = 1/ Aaux - dv[^l^^, V'ixl/^aux in ^3 with 



/ d/laux(/iaux)^ exp ( -"H 

aux V ' ""aux > d^^) - Z]i=l^aux(^au: 

^[^aux, • • • , <x] = Z Z (43) 

J d/lauxexp ( -'Haux(/iaux, d, 0) - Y.i=l V'Lx(^aux) ) 

With K and k determined by the large L-limit in this way, breaking 5*3 [p] up into 0{n) 
and 0{nL) terms becomes trivial and gives 

/d 
n^^V^^^^^;^^^Lx^[V^\^«,^Lx] 
A 1 



X log / dh 



( ~ d ■ ~ \ (44) 

exp (^-■Haux(^aux, d, 0) - ^-^.^ ^aux(^au. 
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S'. 



0(n)r 



i=l 



da ^ 



X log 



X exp —V, h, d, 



>? 



,0) 



(45) 



A. 




Here we have omitted a normalising factor that will be fixed to 1 by the leading order 
saddle point conditions. With 5*1, 5*2 and 5*3 in the desired expanded form we are now 
ready to find the generalisation error tg by a saddle point evaluation of (!35|) . We find 
as before that the 0(1) saddle point conditions constrain vr and vr to be normalised 
densities. Enforcing these normalisation constraints using Lagrange multipliers in the 
higher order saddle point conditions, we are then able to derive self-consistency equations 
for TT and tt. 

The 0{nL) saddle point conditions give self-consistency conditions relating to the 
auxiliary variables. These have the same form as fl23p for the original variables for the 
globally normalised kernel case, but with k, = 1 and 7^ = 0. Assuming once again 
a complex Gaussian solution with ■j/'aux = I'^JuxKiui'^aux, we arrive at self-consistency 
equations that can be solved with population dynamics. These present an apparent 
singularity for Aaux — ^ 0, but using the Woodbury identity this can be removed. 
Defining Maux,d in a similar manner to Md but for the auxiliary variables we may 
rewrite the apparently singular combination that gives the saddle point value of k (see 



section Appendix A.3.1 ) as 

1 dvlijl^ 



d ^ 

auxj 



1 



(46) 



Aaux Aaux Aaux + de^ M^^^^^Bq 

and then set Aaux = 0. 

Finally we consider the 0{n) terms for the non-auxiliary variables, which contain 
the A-dependence that we need to find the generalisation error. The relevant saddle 
point conditions are 



d 



J 5 aux 5 



5 aux 5 ai 



(47) 



x5(^/..-T[t[^i],...,t[^fi] 



where 



d-l 



5 aux' • • • ; 



= J2^'{h)+n(h,d, {de^M-l,eo)-\i 



1=1 



(4^ 



<l>[^](/i) = - log / dh'exp [-J{h,h')-^{h') 



Learning curves of Gaussian process regression on random graphs 



18 



d-l 



1=1 



t[V^K](/t, k) = -log / du'dk' exp{-tp^{K,k')) . 



(49) 
(50) 



Taking advantage of (1501) we can fix ip,^ to be constant. Using this in (H9ll fixes ipf^ to 
be K and ^-independent also. With both ipf^ and tp^ fixed we may remove them from vr 
and if. The remaining equation for the distribution of can then be solved again using 
complex quadratic forms. Setting ip = ^h^V^^h we have the self-consistency equation 
for the distribution of the V, 



dp{d) I 
d \ 



d-\ 



■t=i 



;(5 [v - (6 -Y^XV'X)-^ 



(51) 



with 



= d 





Ir 
■ 2^P 


... 




— i 






— i 


-i 









— i 


Op,p 



(52) 



Once more we may take advantage of the Woodbury identity to deal with the apparent 
A — )• singularity when 7 = (see Appendix A.2.3 ), and can then use population 
dynamics to numerically solve (ISTll . 

Finally, once we have obtained 7r[\4ux] and vr[V|\4ux], the generalisation error can 
be calculated by taking the A derivative of S^^"'\ Utilising the Woodbury identity one 
last time so that we may take the A — )■ limit gives the expression 

1 \ 

(53) 



/a^ + (ejM-; .eo)-ne?M-ieo) 



with Md defined in a similar manner to the global case (see Appendix A.2.3). The 
interpretation is also similar: information from the rest of the graph provides an effective 
prior variance at any given node; the new element here is that this normalised by the 
prior variance in the absence of data, computed from the auxiliary variables. 



4. Results 



Using ( |26|) . (!27|) . (ISTj) and (!53|) . we can now predict the learning curve for GP regression 
with a random walk covariance function on random graphs from ensembles specified 
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by arbitrary fixed degree distributions. We compare our replica-derived predictions 
for botli globally and locally normalised kernels against a graph ensemble equivalent 
of the approximation presented in [8], and numerically simulated learning curves on 
graphs with 500 vertices, for a range of noise levels. In what follows, we will fix the 
hyperparameters a and p to be 2 and 10 respectively, but note that qualitatively similar 
results are obtained for other values of these hyperparameters. 

In figured] we compare learning curve predictions for GP regression with a globally 
normalised kernel on two types of random graph ensembles. Figure [1] (left) shows results 
for the Erdos-Renyi [29] ensemble. Here each edge in the graph is present independently 
with some fixed probability of order This leads to a Poisson form for the degree 

distribution, p\{d) = X'^e~^/d\; we choose the average degree A to be 3. As can be 
seen the replica predictions greatly outperform the approximation from [8], especially 
in the mid section of the learning curve where fluctuations in the number of examples 
at each vertex have the greatest effect (see section [2] for discussion on why this is to 
be expected). In fact the replica theory accurately predicts the numerically simulated 
learning curve along its entire length. Figured] (right) similarly assesses our predictions 
for a graph ensemble with a power law distribution of degrees. This generalised random 
graph ensemble [30] is generated by a superposition of Poisson degree distributions with 
different means, p{d) = J dXp{X)px{d). The distribution of the means A is set to be a 
Pareto or power law distribution, p{X) = aA^/A""^^ for A > A™ with the lower cut off 
Am set to be 2 and the exponent a chosen as 2.5. Once more we see that the replica 
method faithfully predicts the numerically simulated learning curves along their entire 
length. Comparison between the graph ensemble equivalent of [^ (not shown in figure d] 
(right)) and numerically simulated learning curves again shows that this prediction fails 
to accurately predict the midsection of the learning curve. 

Figure |2] (left & right) presents numerically simulated learning curves and 
theoretical predictions for a locally normalised kernel, again on Erdos-Renyi and 
generalised random graph ensembles with the same parameter settings as for the globally 
normalised kernel. Once more we see that our replica-derived learning curve prediction 
is accurate along the whole length of the curve. The graph equivalent of |8], however, 
again fails to accurately predict the mid-section of the leaning curve. We show in the 
inset of figure |2] (left) a more detailed plot of the somewhat unexpected shoulder that 
appears in the learning curves for the locally normalised kernel. This inset shows that 
the shoulder is due to the more realistic normalisation of disconnected vertices. Around 
the shoulder one can show that learning curves are dominated by mean-square errors of 
predictions on single disconnected vertices of the graph [26] . In contrast to the globally 
normalised case the variance at these vertices is fixed to 1 until an example is seen. Once 
on average one example has been seen at each disconnected vertex (i.e. around u = 1), 
the effect of these vertices becomes subdominant and the learning curves decay in a 
similar manner as for the globally normalised kernel. For a more detailed discussion of 
differences arising from the choice of global or local normalisation for the random walk 
kernel we refer to l26l. 
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z/ = NIV V = N/V 

Figure 1. (Left) Learning curves for GP regression with a globally normalised kernel 
with p — 10 and a = 2 on Erdos-Renyi random graphs, for a range of noise levels. 
Solid lines with filled circles: numerically simulated learning curves for graphs of size 
V = 500, solid lines with triangles: replica predictions (see Section [XTt . dashed lines: 
The graph equivalent of W- (Right) Analogue of left figure for power law random 
graphs. 




z/ = N/V V = N/V 

Figure 2. (Left) Learning curves for GP regression with a locally normalised kernel 
with p — 10 and a = 2 on Erdos-Renyi random graphs, for a range of noise levels. 
Solid lines with filled circles: numerically simulated learning curves for graphs of size 
V = 500, solid lines with triangles: replica predictions (see Section [3T2|) . dashed lines: 
The graph equivalent of [8|. (Left inset) Dashed lines show the contribution from 
disconnected single vertices to the learning curve (solid line with filled circles). (Right) 
Analogue of left figure for power law random graphs. 
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5. Conclusions 

In this paper we focussed on predicting learning curves for GP regression with random 
walk kernels of functions defined on large random graphs. Using the replica method we 
derived accurate predictions for two normalisations of the kernel: global normalisation, 
where only the average of the prior variance across all vertices is fixed, and local 
normalisation, where the prior variance is uniform across all vertices. The random 
graph ensembles we considered are specified by a fixed but arbitrary degree distribution, 
making the results applicable to a broad range of graphs. We assume that the average 
degree is finite, which implies that the graphs are locally tree-like. This is, in the end, 
why the replica approach can give exact results for this problem in the limit of large 
graphs. For simplicity we also assumed that training inputs are drawn from a uniform 
distribution across all vertices, though the results generalise straightforwardly to more 
general input distributions. 

We began in section |2] by rewriting the generalisation error (jS]) in terms of the 
partition function of a complex- valued graphical model (1T2|) . In section |3TT] we calculated 
the disorder average of the log partition function, for the case of a globally normalised 
kernel, applying a replica method similar to that of [25] . 

Section 13.21 then dealt with the case of a locally normalised kernel. Here the local 
normalisation constants are determined indirectly as the local prior variances of the 
unnormalised GP, and this non-local dependence on the quenched disorder (i.e. the graph 
structure) renders the analysis rather more complicated. We used a delta enforcement 
term to fix the local normalisers (equation fl28|) ): rewriting this in its Fourier form 
and introducing an auxiliary set of replicas we were once again able to cast the relevant 
partition function in the form of a graphical model. The rest of the analysis then required 
an additional saddle point method inside the usual saddle calculation, to evaluate the 
'action' governing the statistics of the replica densities. 

Finally, in section H] we compared our new predictions to an earlier approximation 
by Solhch [8], suitably extended to the graph case, and numerically simulated learning 
curves. Figure [T]for globally normalised kernels showed that our predictions are accurate 
along the whole length of the learning curve for both Erdos-Renyi and power law 
generalised random graph ensembles and for a range of noise levels. In figure |2] we 
showed similar results for the locally normalised case. We also briefiy discussed the 
change in the shape of the learning curve that is caused by a change in normalisation 
from global to local. 

In the future, it would be interesting to extend our approach to a larger class of 
graph ensembles, which embody graph structure beyond degree distributions. One could 
do this by applying the techniques in [31] where the replica method has been used to 
calculate eigenvalue spectra for graphs with topological constraints and 'generalised' 
degree distributions. The community structures studied in [31] could be further 
generalised by using the more general community graphs considered in [32], where the 
authors consider an adjacency matrix with block form and a sparse connectivity pattern 
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between the blocks. 

An open problem is whether and how one could apply the methods from this paper 
to the significantly more complicated case of model mismatch, where teacher and student 
have different posterior distributions. In this case one would no longer be able to simplify 
the generalisation error to the posterior variance as was done in section Ej so that 
explicit averages over both teacher and student posteriors would need to be performed 
analytically. Preliminary work has shown that this would entail the introduction of an 
additional two local site variables and a second set of replicas. A novel extension of 
mismatch for the graph case would be to consider graph mismatch between student and 
teacher, where the underlying graph structure assumed by the student is a 'corrupted' 
version of the graph used by the teacher. Finally, it would be interesting to study how 
learning curves of real world data compare to the average case learning curves considered 
in this paper, for suitably chosen graph ensembles. 
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Appendix A. Details of replica analysis 

Appendix A.l. Graph normaliser 

In this section we derive the large V asymptotics of the graph ensemble normaliser A/" in 
( |T5|) . This is given as a sum over all possible graphs G; i.e. over all adjacency matrices 
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where going from f lA.2p to ( ]A.3|) we have assumed that is large so that exp(-p:) ^ 1 + 
Defining the density po = with a conjugate enforcement term po, flA.3P 

becomes, 
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(A.4) 



Rewriting the final exponential in terms of its power series and performing the 
integration over di gives 



U =V j dpodpo exp ( V 



n 



(ipo)' 



(A.5) 



Defining the degree distribution p{d) = (l/V) J2i^d,,d as usual and bringing the final 
term into the exponent of the first gives us a form amenable to saddle point evaluation: 



J\f =V dpodpo exp V 
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The saddle point of the exponent occurs when 

= dpo- ipo 
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(A.6) 



(A.7) 
(A.8) 



and solving gives po = 1 and po = —id. 



Appendix A.2. Global normalisation 

We detail in this section some of derivations of the equations in section 13.11 related to 
the learning curve for globally normalised kernels. 
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Appendix A. 2.1. Deriving equation / f7^) from [W\) : Performing the graph average as 
in 



Appendix A.l , f|T6|) becomes, 



/ n 2^ n n I ~ 5Z "^(^i*' ~ '^^^^^ 

■^0 i a=l i \ a=l 



X 



exp - ^ h]) + i{di + d,) - 1 



a=l 
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Using the rephca densities defined in ( IT7|) we see that the final product in (lA.OP can 
be rewritten in terms of fllQap by moving the product over i,j into the exponent and 
reahsing that summing over all and h°: is the same as integrating over the replica 
densities. This gives the term exp(^(S'i — 1)) in f fTSj) . 

In order to use f lTTl) in flTB]) we must introduce a delta enforcement term. 5*2, defined 
in ( llQ^p . is a direct consequence of including the functional delta term, 
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Finally 5*3, defined in f llQcp . is derived from the first product in (]A.9|) with the remaining 
term from (lA.lOp . These factors give 

V a=l i \ a=l / I X 

where /)(. . .j) = f){h]., . . . , /i"). Bringing the integration over h into the exponent by 
using a logarithm and performing the same procedure as in Appendix A.l in going 
from flA.4p to flA.Sp . gives 
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Since we integrate over each site independently and also the data average separates into 
independent Poisson averages over the number of examples 7^ at each vertex, we can 
introduce p{d) = ^di,d and replace the integral over h," with an integral over h,". 

This yields the term exp(l^S'3) in (|T8|) . 



Appendix A. 2. 2. Deriving equation ^Mj from [T^) : Substituting ([20]) & ([2T]) into (I19al) 
to (I19cl) we have 
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We now expand to order n, using = 1 + nlog(x) + . . . This gives 
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exp y—'ip{h) — 'ip{h] 



^^[p\ = ^Pi(^) log 



d\ 



n 



exp ( —^\h] 



1=1 



-1 
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(A.14) 



(A.15) 



(A.16) 



(A.17) 



+nY,P{d) I \{vi,%i,'] /log I d^exp [-n{h,d, k,^)) (a.18) 



The leading 0(1) contributions yield as saddle point conditions for n and vr simply 
J Vipnlip] = 1 and J VipTtltp] = 1. With these normalisation constraints inserted, the 



0(1) terms take the same form as in Appendix A.l Thus for V ^ oo the 0(1) terms 
in the exponent of ( !T8|) will cancel, and we are left as expected with the A-dependent 
subleading 0(n) terms. These lead to the saddle point equations for 7i[ip] and vr [■?/'] 



I^ = d Vi/j'ttI^/j'] log / dhdh' 



-d / V%l)fc[ijj] log / dh 



exp (^-^{h) - f{h') - J{h, h'] 

mzW] 

exp {-^l){h) - ^{h] 



(A.19) 



zmz[^] 
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„ d~l 

d •' i=l 



X ( log / dh 



exp ( -H^h, d, K, 7) - ip{h) - ^f^l 

zWnLzW] 




(A.20) 



_exp {-ilj{h) -ipih] 
-d I VijTc[ip] log / dh— 



mm 

where the 0(1) constraints have been enforced with Lagrange multipliers /i and fi. 
Looking at (lA.lQp . the Z[ip'] and Z[iIj] factors give additive constants that can be 
absorbed, along with the factors of d, into a redefinition of fi. Bearing in mind the 
definition f l25|) of ^, the resulting equation has the form 



fx 



(A.21) 



in which F{-ip, ijj) = log J dh Z ^[tp] exp y—ijj{h) — i'{h)j . As this equation must hold 
for arbitrary ip, it follows that the distribution of must equal the distribution of 



hence 



(A.22) 



(To be precise, ip and "^[ip'] can differ by an arbitrary additive constant because 
F^ip, ip + const) = F^ip, ip) + const; we fix this constant to the most convenient value in 
writing (IA.22p .) Similarly one derives from (IA.20p that 

, d~l 

nm=y -—^{ I \\Vip'm'm-W\...,i!''-'])) (A.23) 



i=l 



with \& defined in ( 124|) . Combining this with (1A.22P gives rise to (1231) . 



Appendix A. 2. 3. Woodbury derivations The apparent singularity in the update 
equations given by f l26|) can be eliminated by using the Woodbury identity for inverting 
matrices [3^. We may write 

-1 Ti/T-l Tn/r-1 



eo 



dn 



'y/a^ + A 



{^/a^ + X)/id^) + e^M,\eo 



(A.24) 



where Md- 



eo ^/^2_^x ^0 contains only A-independent terms. On 



the right hand side of flA.24p we can now take A — )■ without problems, even when 
7 = 0. 

A similar method may be applied to arrive at ( 1271) . Direct differentiation with 



respect to A of 5*^^"^ defined as the coefficient multiplying n in flA.lSP will result in the 
following expression for the generalisation error: 



lim > p(d)d 



-f/a^ + A 



dn 
-f/a^ + A 



e^{0-J2xV'X)-'eo 



(A.25) 
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If we now define in analogy to Md-i above, but with the sum over i running from 
1 to (i, we can substitute the analogue of flA.24|) into flA.25|) and take A — ?■ to arrive 

at (EZD. 

Appendix A. 3. Local normalisation 

We detail in this section some of the derivations of equations in section 13.21 related to 
the locally normalised kernel case. 

Appendix A. 3.1. Deriving equations ( f^^P ^ (4^) from equation Ii36c\) : We proceed as 
we did for the globally normalised kernel, substituting fl37|) and f l38|) into (136 cp and 
taking advantage of = 1 + nlog(x) + . . . After extracting the 0(1) terms we are left 
with 



1=1 



dhdhidkL 
2^r 



/ n ^^^^^ -^{h, d, K, 7) + iLkk - ^ 'Haux(/iLx, k) 
^ 1=1 \ 1=1 / 

J_ exp (^-ij\h) - i){{K, k) - i'LAhL 
\=i Z[i;^]Zm ntiZii^l^J 

Here we have omitted a normalising denominator that will be set to unity by the 0(1) 
saddle point constraints. Grouping together the n.,k and /laux terms we see we may 
rewrite flA.26p in the form 

« d 

nJ2p(d) / n^^'^^^^^aux^[^^^-^aux] 



i=l 



log y y —^^({h, K, k) exp log j d/iaux^(^aux, k] 



(A.27) 



with 



exp (-n{h, d, K, 7) - ^f^i #(/i) - ^f^i ^i{K, k) 
C{h, K, k) = ^ . . . (A.28) 

ulizmzm 

exp ( -?/aux(/iaux, d, k) + IKk - J^Ll ^aux(^aux) ) 

e(^aux, ^, k) = ^ L (A.29) 

11^=1 ^Faux] 

For L large the integral over k and k will be dominated by its saddle point. The saddle 
point conditions are 

= i/t (A.30) 



/ d/iaux [-i/(Aaux + 2m) +iK + id{hl^^Y/ (Aaux + 2i/t)2] exp U(^aux, /t, k) 

= ^ ^ ^. (A.31) 

Jdh 

aux 

exp ( ^{h 

aux ; ^ 7 ^ J 
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This gives us, using the functional v[. . ] defined in 

K = 0. (A.33) 

One subtlety here is that we need to make sure that the integral over k and k in flA.27p 
is evaluated accurately including 0(1) prefactors, because such contributions will feed 
into the 0{n) terms in the action. To obtain these prefactors one needs to expand the 
exponent of (]A.27p to second order about the saddle point. Fortunately the relevant 
Hessian of log ^ j d/iaux^(/iaux, z^, k)^ has a constant determinant, and the factor from 
the fluctuation correction is simply (27r/L). This just cancels the scaling factor in front 
of the integral over k and k. In this way one obtains for the 0{nL) contribution to 5*3 
the saddle point value of the exponent in (lA.27p . leading to f l44p in the main text, while 
the 0{n) contribution given in fj45|l comes from the prefactor j dh({h, k, k). 
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